On this page we detail the quality control (QC) pipeline for for the BipEx dataset. Further plots, the underlying code and a document summarising the pipeline can be found on the BipEx github repository.

We first summarise the collection of samples, splitting across cohorts and subtypes. In addtition to Bipolar cases, we also a collection of Schizophrenia cases that will serve as positive controls for our PTV burden analyses.

LOCATION Controls Bipolar Disorder Schizoaffective Schizophrenia Other Unknown Total
Aberdeen, UK 331 0 0 564 0 1 896
Amsterdam, NED 1,611 1,212 21 1 58 17 2,920
Baltimore, USA 126 380 0 0 8 0 514
Boston, USA 3,498 3,449 52 0 0 0 6,999
Cambridge, UK 2,851 0 0 0 0 0 2,851
Cardiff, UK 1,106 2,442 68 2,990 18 0 6,624
Dublin, IRE 9 180 11 29 3 0 232
Edinburgh, UK 64 885 6 304 0 0 1,259
London, UK 1,203 1,909 157 1,595 0 0 4,864
Stockholm, SWE 5,541 5,160 1 0 0 0 10,702
Umea, SWE 459 472 0 0 0 0 931
Wurzburg, GER 414 397 0 0 0 14 825
Total 17,213 16,486 316 5,483 87 32 39,617

Within the collection of Bipolar cases, we have subtype information: Bipolar 1, and Bipolar 2. Further splitting the cases and labelling Bipolar cases for whom we do not have subtype information available, we obtain the following numbers of samples in each subcategory:

LOCATION Controls Bipolar Disorder 1 Bipolar Disorder 2 Bipolar Disorder Schizoaffective Total
Aberdeen, UK 331 0 0 0 0 331
Amsterdam, NED 1,611 1,032 169 10 21 2,843
Baltimore, USA 126 358 9 13 0 506
Boston, USA 3,498 2,122 390 937 52 6,999
Cambridge, UK 2,851 0 0 0 0 2,851
Cardiff, UK 1,106 1,518 772 152 68 3,616
Dublin, IRE 9 180 0 0 11 200
Edinburgh, UK 64 368 114 403 6 955
London, UK 1,203 1,309 372 227 157 3,268
Stockholm, SWE 5,541 2,364 1,753 1,043 1 10,702
Umea, SWE 459 320 149 3 0 931
Wurzburg, GER 414 216 159 22 0 811
Total 17,213 9,787 3,887 2,810 316 34,013

Finally, the split by PI was as follows:

PI Controls Bipolar Disorder 1 Bipolar Disorder 2 Bipolar Disorder Schizoaffective Schizophrenia Other Unknown
Andreas Reif 414 216 159 22 0 0 0 14
Andrew McQuillin 1,203 1,309 372 227 157 1,595 1 0
Bob Yolken 126 117 9 13 0 0 8 0
Danielle Posthuma 948 0 0 0 0 0 0 1
David St Clair 331 0 0 0 0 564 0 1
Derek Morris 9 180 0 0 11 29 3 0
Douglas Blackwood 64 368 114 403 6 304 0 0
Fernando Goes 0 241 0 0 0 0 0 0
Jordan Smoller 3,498 2,122 390 937 52 0 0 0
Michael ODonovan 0 0 0 0 11 2,986 1 0
Michael Owen 1,106 0 0 0 0 0 0 0
Mikael Landen 761 2,364 1,753 1,043 1 0 0 0
Nancy Pedersen 4,780 0 0 0 0 0 0 0
Nick Craddock 0 1,518 772 152 57 4 17 0
Roel Ophoff 663 1,032 169 10 21 1 51 24
Rolf Adolfsson 459 320 149 3 0 0 0 0
Willem Ouwehand 2,851 0 0 0 0 0 0 0
Total 17,213 9,787 3,887 2,810 316 5,483 81 40

For our QC pipeline, we first read in the .vcf file, split multiallelics, and remove sites with more than 6 alleles. After splitting muliallelics in the .vcf file containing 29,911,479 variants and restricting to these sites, we have 37,344,246 variants.


Initial genotype filtering

Our first step (after conversion of the joint called .vcf file to a hail matrix table) is to remove genotypes based on the following collection of criteria:

  • If homozygous reference, remove if at least one of the following is true:
    • Genotype quality \(<\) 20
    • Depth \(<\) 10
  • If heterozygous, at least one of the following is true:
    • (Reference allele depth + alternative allele depth) divided by total depth \(<\) 0.8
    • Alternative allele depth divided by total depth \(<\) 0.2
    • Reference phred-scaled genotype posterior \(<\) 20
    • Depth \(<\) 10
  • If homozygous variant, at least one of the following is true:
    • Alternative allele depth divided by total depth \(<\) 0.2
    • Reference phred-scaled genotype posterior \(<\) 20
    • Depth \(<\) 10


Initial variant filtering

Remove variants that either:

  • Fall in a low complexity region
  • Fail VQSR
  • Fall outside padded target intervals (50bp padding)
  • Are invariant after the initial GT filter


Filter Variants %
Variants with > 6 alleles 37,344,246 100.0
Failing VQSR 100,742 0.3
In LCRs 1,215,218 3.3
Outside padded target interval 27,119,165 72.6
Invariant sites after initial variant and genotype filters 3,117,961 8.3
Variants after initial filter 6,829,373 18.3

Initial sample quality control

Filter Samples %
Initial samples 39,592 100.0
Unknown phenotype 32 0.1
Low coverage or high contamination 133 0.3
Samples after initial filter 39,427 99.6

We run the sample_qc function in hail and remove samples according to the following:

  • Sample call rate \(<\) 0.93
  • FREEMIX contamination (%) \(>\) 0.02
  • Percentage chimeras (%) \(>\) 0.015
  • Mean depth \(<\) 30
  • Mean genotype quality \(<\) 55

Thresholds used were based on plotting the distributions of these metrics. A full collection of plots can be found in the repository. Here we show boxplots with overlaid scatterplots of the above metrics, split by sequencing batch, and coloured by location. The threshold for exclusion is shown as a dashed line.

Filter Samples %
Initial samples 39,427 100.0
Sample call rate < 0.93 185 0.5
% FREEMIX contamination > 0.02 267 0.7
% chimeric reads > 0.015 152 0.4
Mean DP < 30 20 0.1
Mean GQ < 55 56 0.1
Samples after sample QC filters 38,871 98.6

Following this step, we export high quality variants (allele frequency between 0.01 to 0.99 with high call rate (> 0.98)) to plink format and prune to pseudo-independent SNPs using --indep 50 5 2. This pruned set of SNPs feeds into the next few stages of the QC pipeline.


Sex imputation

We impute the sexes of the individuals with this pruned set of variants on the X chromosome, and create list of samples with incorrect or unknown sex as defined by:

  • Sex is unknown in the phenotype files
  • F-statistic \(>\) 0.6 and the sex is female in the phenotype file
  • F-statistic \(<\) 0.6 and the sex is male in the phenotype file

Here we show the distribution of the F-statistic, with the 0.6 threshold defining our sex impututation shown as a dashed line.

Filter Samples %
Initial samples 39,427 100.0
Samples with sex swap 238 0.6
Samples after sex swap removal 38,633 98.0


IBD

Using the identity_by_descent method in hail, we evaluate \(\hat{\pi}\) between pairs of samples, and filter based on a threshold of 0.2 shown as a dashed line on the plot below.

We then create a sample list of patients such that no pair has \(\hat{\pi} >\) 0.2.

Filter Samples %
Initial samples 39,427 100.0
Related samples for removal 1,716 4.4
Samples after IBD removal 38,633 94.2


PCA

We next perform a number of principal component analysis (PCA) steps to ensure that we have matched cases and controls in our cleaned dataset.


Initial PCA

We first run PCA on samples after removing relateds and those that passed initial QC, using the pruned set of variants.


PCA including 1000 genomes

Next, we included the 1000 genomes samples (minus the small subset of related individuals within 1000 geneomes), and rerun PCA after including those individuals. Plots of the first six principal components are shown below. 1000 genomes samples are coloured in dark blue.

We restrict to the European subset of individuals to perform analysis. To do this, we train a random forest on the super populations labels of 1000 genomes and predict the super population that each of the BipEx samples. We denote strictly defined European subset as those with probability \(>\) 0.95 of being European according to the classifier. BipEx samples are coloured by their assignment or unsure if none of the classifier probabilities exceeded 0.95 in the following plots.

Filter Samples %
Initial samples 37,187 100.0
Non-European, strict filter 2,451 6.6
Non-European, loose filter 1,214 3.3
Samples after strict European filter 34,736 93.4

Samples not assigned to the European cluster were removed from downstream analysis.

In addition, using a much looser definition of European, we restrict to US samples from MGH and Johns Hopkins, and run PCA. This enabled us to identify Ashkenazi Jewish clusters, and create a list of outliers (AJ or otherwise) for downstream removal or independent analysis.

Run also ran a further collection of PCAs on:

  • Strictly defined Europeans and Ashkenazi Jewish individuals
    • Use Ashkenazi Jewish cluster to train a random forest and determine if there are further Ashkenazi Jews in the remainder of the dataset.
  • Strictly defined Europeans
Filter Samples %
Initial samples 37,187 100.0
Non-European, loose filter 1,214 3.3
Ashkenazi Jewish samples 463 1.2
Samples after loose European filter and AJs removed 35,510 95.5

However, upon restriction to the European cluster and after removal of AJs, we find that we have a dense case-control matched collection of samples and so decide not to analyse Swedes, Finns and Europeans (excluding Finns and Swedes) separately.


Final variant filtering

For our final variant filtering step, we first restrict to samples in the strictly defined European subset, filter to the unrelated list, and filter out samples with incorrectly defined sex or unknown sex, and run variant QC. We then evaluate a collection of variant metrics and remove variants that satisfy at least one of:

  • Invariant site in cleaned sample subset
  • Call rate \(<\) 0.97
  • Control call rate \(<\) 0.97
  • Case call rate \(<\) 0.97
  • \(|\)Case call rate - Control call rate\(| >\) 0.02
  • \(p\)-value for Hardy Weinberg Equilibrium \(<\) 10-6

The following plots show the 0.97 threshold for call rate and 0.02 threshold for difference in call rate between cases and controls respectively.

Filter Variants %
Variants after initial filter 6,829,373 100.0
Invariant sites after sample filters 1,051,421 15.4
Overall variant call rate < 0.97 737,072 10.8
Overall variant case call rate < 0.97 716,709 10.5
Overall variant control call rate < 0.97 743,659 10.9
Difference between case and control variant call rate < 0.02 232,341 3.4
Variants failing HWE filter 1,083,479 15.9
Variants after filters 5,104,759 74.7

After these steps we plot the resulting changes in metrics across the samples in our data set. Each of the following plots splits the data by sequencing data and colours the points based on location. The first collection of subplots in each figure shows the variant metrics before sample removal, with the lower collection of subplots showing the resultant change after our QC steps.


Final sample filtering

In this step we remove sample outliers after the variant cleaning in the previous step. Samples are removed if at least one of the following lies more that three standard deviations away from the mean:

  • Ratio of transitions to transversions
  • Ratio of heterozygous to homozygous variant
  • Ratio of insertions to deletions
  • Number of singletons
Filter Samples %
Samples after population filters 34,298 100.0
Within batch Ti/Tv ratio outside 3 standard deviations 100 0.3
Within batch Het/HomVar ratio outside 3 standard deviations 150 0.4
Within batch Insertion/Deletion ratio outside 3 standard deviations 93 0.3
Within location n singletons outside 3 standard deviations 443 1.3
Samples after filters 33,527 97.8

As a final step, we export common (allele frequency between 0.01 and 0.99) variants to plink format, prune, and evaluate final principal components for downstream analysis. The first six principal components are displayed below and coloured by case status.


After all of this data cleaning, we save the resultant hail matrix tables for downstream analyses.

The resultant composition of the samples was as follows:

LOCATION Controls Bipolar Disorder Schizoaffective Schizophrenia Other Unknown Total
Aberdeen, UK 322 0 0 521 0 1 844
Amsterdam, NED 1,359 1,116 19 1 57 17 2,569
Baltimore, USA 41 267 0 0 4 0 312
Boston, USA 2,544 2,434 31 0 0 0 5,009
Cambridge, UK 2,656 0 0 0 0 0 2,656
Cardiff, UK 1,006 2,108 65 2,489 17 0 5,685
Dublin, IRE 7 150 11 27 2 0 197
Edinburgh, UK 58 711 6 271 0 0 1,046
London, UK 1,082 1,731 144 1,476 0 0 4,433
Stockholm, SWE 4,530 4,609 1 0 0 0 9,140
Umea, SWE 426 441 0 0 0 0 867
Wurzburg, GER 391 366 0 0 0 12 769
Total 14,422 13,933 277 4,785 80 30 33,527

The bipolar subtype information of the curated samples is:

LOCATION Controls Bipolar Disorder 1 Bipolar Disorder 2 Bipolar Disorder Schizoaffective Total
Aberdeen, UK 322 0 0 0 0 322
Amsterdam, NED 1,359 951 155 9 19 2,493
Baltimore, USA 41 254 6 7 0 308
Boston, USA 2,544 1,503 279 652 31 5,009
Cambridge, UK 2,656 0 0 0 0 2,656
Cardiff, UK 1,006 1,301 681 126 65 3,179
Dublin, IRE 7 150 0 0 11 168
Edinburgh, UK 58 317 94 300 6 775
London, UK 1,082 1,169 350 211 144 2,956
Stockholm, SWE 4,530 2,095 1,595 919 1 9,140
Umea, SWE 426 297 141 3 0 867
Wurzburg, GER 391 201 145 20 0 757
Total 14,422 8,238 3,446 2,247 277 28,630